[Docs] [txt|pdf] [Tracker] [Email] [Nits]
Versions: 00
Network Working Group A. Dalela
Internet Draft Cisco Systems
Intended status: Standards Track December 30, 2011
Expires: June 2012
Datacenter Solution Approaches
draft-dalela-dc-approaches-00.txt
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
This Internet-Draft will expire on June 30, 2012.
Copyright Notice
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document.
Dalela Expires June 30, 2012 [Page 1]
Internet-Draft Datacenter Approaches December 2011
Abstract
There are many approaches to addressing virtualized datacenter
scaling problems. Examples of these approaches include, L2 vs. L3
forwarding, host-based vs. network-based solutions, fat-access and
lean-core vs. fat-core and lean-access, flat addressing vs.
encapsulation, protocol learning vs. directories for location
discovery, APIs vs. protocols for orchestration, etc. Different
solutions being proposed today take one or more of these approaches
in combination, although sometimes the question of approach itself
may not be settled. Given the multiple facets of the datacenter
problem, and many approaches to solve each problem, it becomes hard
to discuss a solution when some approaches may be acceptable while
others are not. This document discusses the pros and cons of various
approaches. The goal is not to describe a specific solution, but to
evaluate the various approaches. This document concludes with a set
of recommendations on which approaches are most optimal for a
holistic solution to the entire problem set.
Table of Contents
1. Introduction...................................................3
2. Conventions used in this document..............................3
3. Terms and Acronyms.............................................4
4. Problem Statement..............................................4
5. Possible Solution Approaches...................................4
5.1. Addressing Approaches.....................................4
5.1.1. Mobile IP Approach...................................4
5.1.2. Two Address Spaces...................................4
5.1.3. Host Based Solutions.................................6
5.1.4. Hierarchical Addressing..............................7
5.2. Multi-Tenancy Approaches..................................7
5.2.1. VLAN Based Approaches................................8
5.2.2. GRE Encapsulation....................................8
5.2.3. MPLS Header..........................................8
5.3. Datacenter Interconnectivity Approaches...................9
5.3.1. BGP MPLS VPN Approach................................9
5.3.2. New Routing Protocol at Datacenter Edge.............10
5.3.3. L2 Overlay Interconnects............................11
5.3.4. Common Intra and Inter Datacenter Technology........12
5.4. Forwarding Approaches....................................13
5.4.1. L3 Forwarding.......................................13
5.4.2. L2 Forwarding.......................................14
5.4.3. Hybrid Approaches...................................15
5.5. Discovery Approaches.....................................15
5.5.1. Protocol Based Route Learning.......................16
5.5.2. Address Location Registries.........................16
Dalela Expires June 30, 2012 [Page 2]
Internet-Draft Datacenter Approaches December 2011
5.5.3. Routing-Registry Hybrid Approach....................17
5.6. Cloud Control Approaches.................................18
5.6.1. Application APIs....................................18
5.6.2. Network Protocol Approach...........................19
6. Recommendations...............................................19
7. Network Architecture..........................................22
8. Security Considerations.......................................23
9. IANA Considerations...........................................23
10. Conclusions..................................................23
11. References...................................................23
11.1. Normative References....................................23
11.2. Informative References..................................23
12. Acknowledgments..............................................23
1. Introduction
The problem statement [REQ] describes a set of problems that need to
be collectively solved for datacenters. Many of these problems are
inter-linked, and a solution to one problem that overlooks the others
makes the solution to other problems a little harder. Any approach
that is adopted to solving the datacenter problems therefore should
be evaluated against a wider set of issues that need to be
collectively addressed rather than one at a time.
Given a broader set of issues, this document tries to evaluate the
various solution approaches against those issues. The goal here is
not to propose a specific solution, but to understand the pros and
cons of taking an approach with respect to the wider problem set.
We conclude this document with a set of recommendations on the
approaches that can be used in combination to address the entire
problem set. This can then be used to devise specific solutions. The
discussion of those solutions will not need to re-open questions
about the approach itself, and that is hopefully better.
2. Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC-2119 [RFC2119].
In this document, these words will appear with that interpretation
only when in ALL CAPS. Lower case uses of these words are not to be
interpreted as carrying RFC-2119 significance.
Dalela Expires June 30, 2012 [Page 3]
Internet-Draft Datacenter Approaches December 2011
3. Terms and Acronyms
NA
4. Problem Statement
This is described in the problem statement document [REQ].
5. Possible Solution Approaches
This section discusses the various design approaches that can be
adopted to solving the datacenter issues. These include approaches to
solve mobility, inter-connectivity of datacenters, handling multi-
paths to a destination, cloud orchestration, etc.
5.1. Addressing Approaches
Addressing issues primarily arise due to mobility, and secondarily
because of connecting public and private domains which might be using
the same IP address range. Both issues are important for datacenters.
5.1.1. Mobile IP Approach
In the Mobile IP approach a mobile node is assigned a location
independent address whose routes are advertised by the Home Agent.
The mobile node itself is bound at the link-level to the Foreign
Agent. The traffic is then tunneled between Home and Foreign agents.
The challenge here is that all packets must pass through the Home
Agent (at least when going towards the mobile node) and this cannot
use the shortest path or multiple paths to destination.
Shortest paths and multiple paths to a destination are essential
requirements for datacenter traffic. The mobile IP approach is
therefore unsuited for datacenter traffic.
5.1.2. Two Address Spaces
Many current approaches separate the location address space from the
identifier address space. The location address space refers to the
routers or switches while the identifier address space refers to
hosts. The mapping between the location and identifier address spaces
can be done by carrying host-routes within the native routing
protocol, by a new routing protocol that carries host routes over the
native protocol or by snooping existing protocol packets like ARP.
Subsequently, packets are tunneled to the location switch or router
in the outer address, de-capsulated, and forwarded to host.
Dalela Expires June 30, 2012 [Page 4]
Internet-Draft Datacenter Approaches December 2011
As VMs move, location independence requires host-level locator-
identifier bindings to be pushed into the network.
If these bindings are pushed everywhere using the native routing
protocol, these bindings will be present in both the access and the
core. The first bottleneck faced in this case will be in the core,
which has to hold many host routes. As these hosts increase, this
approach becomes impossible to scale in the network core.
If however these bindings are created using a new routing protocol
that runs between edges or by snooping existing protocol packets
(such as ARP), at the edges, the location-identifier bindings are
only present at the network edges and not the core. This approach is
obviously an improvement over the native routing protocol approach.
However, as host mobility increases, and the corresponding hosts are
placed in different locations, the host routes at the edge begin to
increase rapidly. For example, if a host has 25 VM, each with 4
virtual NICs, and an access switch connects to 48 such hosts, and
each virtual NIC on a host corresponds with 50 other NICs that are
situated in different locations, the total number of host routes
needed at the access will be 25 * 4 * 50 * 48 = 240,000.
This number obviously depends on the application and network design.
In some cases, a host may correspond with thousands of hosts, but may
not be virtualized. In other cases, the number of VMs per physical
host may be more although they don't have as many virtual interfaces.
In the worst case, all the above numbers could be higher.
Also note that the host routes are in addition to other things: (a)
network routes, (b) local host-port bindings, (c) policies such as
access lists, etc., which currently exist and will continue.
Regardless of whether the number of host routes is large or not, what
is undeniable is that these are additional entries.
Experience shows that network sizes grow at an exponential rate, and
the VM density per host, the distribution of compute across multiple
nodes, and VM mobility are trends that will only increase with time.
Each of these factors will increase host routes. The expectation is
also that massively scaled datacenters should decrease the overall
cost of infrastructure. The cost of compute will decrease as mobility
and distribution are applied but the cost of the network will
increase with growing table sizes. This puts compute and network at
opposite ends of the cost trend, and long-term this is not viable.
Encapsulated packets make the application of security and QoS
policies a little harder. The firewall, load-balancer, application
Dalela Expires June 30, 2012 [Page 5]
Internet-Draft Datacenter Approaches December 2011
optimizer, packet policers, or other kinds of network services have
to be aware that packets have to be analyzed based upon the inner
addresses and not based on the destination switch or router address.
This is particularly true when the same destination has hosts
belonging to many tenants each with different policies. This fact
complicates the design of all network services, and may make existing
hardware accelerated network service equipment obsolete. Different
encapsulation techniques are further incompatible with each other and
with network services that might be separately deployed.
5.1.3. Host Based Solutions
Address space separation can be achieved in the host instead of the
network. For instance, it is possible that a host is aware of two IP
addresses, one that it exposes to the network and the other that it
exposes to the applications. When an application needs to send a
packet to another application, it would use the other application's
address. But, the host operating system below the application will
map the application address to the remote host address.
This scheme becomes very intuitive with VMs. Now, a remote host is
identified by the IP address of the VM hypervisor while the
application is identified by the VM. When a VM sends a packet, the
hypervisor will append its IP as the outer IP. It will also resolve
the location of the remote application to a remote hypervisor's IP
through a new protocol, and forward the packet. The network has a
static IP configuration and is unaware of the existence of VMs. Since
any VM can be on any hypervisor, VMs are location independent.
Since each VM will periodically ARP for its destination, these ARPs
also need to be trapped by the hypervisor (or the virtual switch
inside the hypervisor). The switch can respond back locally through a
cache or emit another protocol query to a mapping database.
Since the rest of the network is unaware of the existence of the VM,
the difficulties described above with respect to network services
appear here as well. Security for example may also have to be
implemented in the hypervisor based firewall. There are additional
overheads in processing each packet and adding/removing headers.
Since all this happens on the host CPU, a greater percentage of CPU
time is spent on network processing. This is more expensive because
hardware accelerated network processing will do that at much greater
speeds and with much lower amounts of energy consumed.
Another disadvantage of doing network functions in the host is that
the total number of network devices to manage grows by a few orders
of magnitude. For example, if this was applied to firewall management
Dalela Expires June 30, 2012 [Page 6]
Internet-Draft Datacenter Approaches December 2011
of each tenant's personal VM firewall, the total number of firewalls
to be managed will be very high (of the order of physical hosts). The
operator cannot have a single consolidated view of all the firewall
rules in a single place. And if additional rules had to be installed,
they would need to be propagated to many firewalls.
5.1.4. Hierarchical Addressing
IP addressing is already hierarchical, so by this we mean use of
Hierarchical MAC addresses. A hierarchical MAC address has "network-
bits" and a "host-bits" just like an IP address. The boundary between
the network and host could be fixed or variable.
As an example, a hierarchical MAC's higher-order bits could represent
a "switch-id" while the lower order bits could represent the "host-
id". Given that a MAC address has 48 bits as compared to the 32 bits
in the IPv4 network, use of hierarchical MAC addresses implies that
the a datacenter cluster cloud could be many times larger than the
IPv4 Internet! If packets are forwarded using hierarchical MAC
addresses, it brings L3 scaling properties to L2 networks.
Note that L2 networks already have mobility. In current L2, this
mobility is possible based on a fixed MAC address whose location has
to be detected through conversational learning on the L2 switches
before packets are forwarded. Learning and broadcast in the network
however make current L2 networks not scalable. Hierarchical MACs
solve this issue. It is now not necessary to learn the full MAC
address but only the higher-order bits. If the higher order bits
represent switch-ids, then this learning never needs to be changed
unless a switch is added or removed from the network. The total
number of hardware entries anywhere in the network equals the total
number of switches and remains agnostic of VM mobility.
Note that the entries on switch-id's are in lieu of network routing
entries and can be treated as network routing entries. Therefore the
total entries required is nearly the same as that required currently
for static L3 routing. The host still has two addresses (IP and MAC),
but now the identifier is IP and the locator is MAC.
5.2. Multi-Tenancy Approaches
Depending on the type of forwarding (L2 or L3) different types of
multi-tenant segmentation can be applied. As described in the problem
statement [REQ] both L2 and L3 segments can have issues.
Dalela Expires June 30, 2012 [Page 7]
Internet-Draft Datacenter Approaches December 2011
5.2.1. VLAN Based Approaches
Of course there are only 4096 VLANs and therefore this approach can't
be scaled to many tenants. Further, a customer may need more than one
VLAN and they may span these VLANs from private domains.
To allow these scenarios, extensions of VLAN such as Q-in-Q could be
used. The inner Q could represent the VLAN and the outer Q the
customer and this will allow 4096 customers who have the full range
of 4096 VLANs. This should accommodate each customer, but it may not
be enough to support enough customers in cloud.
We might now use a Q-in-Q-in-Q to segment customers into customer
classes (such as gold, silver, bronze, etc.). Alternately, we can
treat the 36 bits as a contiguous VLAN space that can be allocated to
users on demand. The latter has the issue that a mapping between
private and public VLAN spaces will need to be done at the network
edges. For instance if a private VLAN 10 corresponds to a public VLAN
100, then a mapping between 10 and 100 must be maintained at the edge
and the packet must be modified in both directions. The total number
of such mappings may not be very high, and these may be distributed
over many Provider Edge (PE) or Customer Edge (CE) routers.
5.2.2. GRE Encapsulation
The GRE Key is 32 bits long and therefore can support a very large
number of customer segments. However, GRE will work with L3
forwarding because it is transmitted inside the IP header. This
segmentation scheme has no scaling problems except that to support IP
mobility, the mobility schemes themselves require encapsulation,
which has the challenges as described above. The net result of this
scheme is that there are two headers required - one for segmentation
(GRE) and another for mobility. Since this is running over L3, all L2
information (such as VLAN) would be lost.
5.2.3. MPLS Header
MPLS has been used in the internet to segment flows. The MPLS label
is 20-bits long, which can be used to support over a million
customers. Note that each customer could use a full range of 4096
VLANs as well, so this does not overlap the L2 segments with tenant
segments. This scheme works equally well with L2 and L3 networks, and
affords a sufficient amount of scale in both cases.
This scheme can also be used to give per-customer quality of service
or other types of policies as the packets traverse the Internet. It
Dalela Expires June 30, 2012 [Page 8]
Internet-Draft Datacenter Approaches December 2011
is this ability to use MPLS labels across private, public and
internet domains that makes it a very convenient option.
This segment can be inserted in the packet at the access layer inside
the datacenter (similar to how VLAN tags are inserted) and removed at
the remote access layer. For remote connectivity with single tenant
datacenters, the tenant id could be inserted and removed at the
Customer Edge (CE) router. Cloud datacenter would transparently pass
the packet into the datacenter and remove the tenant at the access.
The segmented packets can be transported over L2 VPNs. The
authenticated VPN tunnel endpoints should be used to map (and drop)
packets whose endpoint addresses don't match with the segment. The L2
VPN could for example be a EoMPLS VPN whose MPLS label stack should
be matched against the tenant identifier in the Ethernet packet. The
cloud can be treated as one more "site" for the cloud customer and
MPLS VPN services can be extended to these customers.
5.3. Datacenter Interconnectivity Approaches
Three broad approaches are possible for datacenter interconnectivity.
First, push the datacenter routes into the Internet and let the
Internet determine the right location of a host. Second, the location
is determined at the edge of the datacenter, and packets are
transported over the Internet but the mechanisms within and between
the datacenters are different. Third, we use an overlay scheme
between datacenter edges, but a common mechanism is used within and
between datacenters. These approaches are described below.
5.3.1. BGP MPLS VPN Approach
This approach involves a flat addressing which has traditionally been
used for site-to-site connectivity. In this approach, the intranet
routes are pushed into the Internet through BGP. Routing (unicast and
multicast) between the sites is handled by the Internet core.
However, traditionally there have been no mechanisms to support VM
mobility. This mobility can cause address fragmentation and bloat the
forwarding tables in the Internet. The advantage of this approach is
that bandwidth and security are guaranteed.
While this approach is not the preferred mechanism, in some cases
(such as the Virtual Private Cloud, where an entire subnet is
reserved for a customer at the provider's site) it might be used.
Ideally, in this scenario, VM mobility would be restricted to within
the site. If the subnet is provided by the customer itself, then the
customer could potentially move the entire subnet from one provider
to another in case of disaster (assuming that the services are
Dalela Expires June 30, 2012 [Page 9]
Internet-Draft Datacenter Approaches December 2011
recreated in the new location through automated schemes). The edge
router at the new location would advertize the routes to the entire
subnet and packets would be transparently routed.
5.3.2. New Routing Protocol at Datacenter Edge
In this approach, a new routing protocol would propagate the routes
of the moving hosts between the edge routers. Once routes to a host
are known at the edge routers, the packets would be encapsulated into
an IP header with the destination address of the destination router.
This is similar to separating the identifier and locator address
spaces as described for intra-datacenter mobility earlier.
Before the location is propagated via the routing protocol, the
location must be detected first. This has to be achieved by using
conversation learning. This learning could be based on traditional L2
learning, some variation of L2, or by running the new routing
protocol end-to-end within and between datacenters. In each of these
cases, some packet from the host with its IP address must be seen on
the network. Once the host has been detected, its location can be
propagated. If a host has never spoken, its location would not be
known and the host is unreachable. This problem is avoided in L2
networks where a source will broadcast ARP to force a response from a
destination that is otherwise not sending any packets. Conversation
learning of the host location is therefore absolutely necessary.
This shows that L3 location independent schemes must use the L2 type
conversation learning. The encapsulation scheme in L3 case may be
different but the basic mechanisms are identical in two cases.
The routing tables need to be segmented into VRFs to identify
different tenants. If two sites of a customer are connected to two
sites of a provider, collectively these four sites form a VRF. The
peers in one VRF will be different than the peers in another VRF. At
this time if the protocol uses conversation learning to advertize
routes, it needs to know ahead of time which VRF an IP should be
advertised into. This is because the IP across these VRFs might be
duplicated. That means that the VRF advertisements must depend on how
the packets are segmented inside the datacenter.
For instance, the VLANs, GRE keys or MPLS labels as described above
should be mapped to VRFs. Since hosts are dynamically detected,
location propagation from intra-datacenter to inter-datacenter must
incorporate the segment as well. Similarly, traffic received from a
far-end must also contain the appropriate segmentation technique
(e.g. GRE, MPLS label, or some Route Identifier in header) to
identify that the packet belongs to a particular VRF.
Dalela Expires June 30, 2012 [Page 10]
Internet-Draft Datacenter Approaches December 2011
If datacenters are relatively static, the signaling demands at the
edge (to program new locator-identifier bindings) may be no worse
than DNS resolution that is employed infrequently to resolve the name
to IP binding before sending packets. The entries at the edge would
be long-lived. However, if the datacenters are very dynamic and lots
of resources are rapidly created this can become an overhead. Such
issues may also arise in case of disaster recovery or site outage
when resources are rapidly recreated in another site.
The forwarding plane scale needs for inter-datacenter connectivity
are identical to that in intra-datacenter encapsulation schemes. That
is, host route entries are required for host mobility across sites.
In the inter-datacenter case, because of fewer edge points, these
entries will be concentrated at fewer points, and will require higher
capacity routers at the edge. Note that inter-datacenter mobility is
a key use-case in "follow the sun" models.
Inter-datacenter connectivity also needs to build multicast
distribution trees into the edge routers. This will require similar
approaches as PIM for the intra-datacenter cases. Note that these
trees may need to be optimized for workload placement such that the
tree directly routes packets between sites that have the most number
of clients for a given multicast group.
5.3.3. L2 Overlay Interconnects
In some cases, it is necessary to span the VLAN across sites. For
example, a web-server and application server may be located at one
site while the database server and the storage are in another site.
The application and database servers are within a VLAN.
If the VLAN is spanned across multiple sites, there is need to
control the broadcast at the edges. For example, this may involve
using the discovered IP to MAC bindings to respond to periodic ARP
broadcasts. Similar to multicast trees, VLAN spanning also involves
construction of broadcast trees. And similar to how multicast routes
are propagated between intra and inter-datacenter, a single per-VLAN
spanning tree needs to be constructed for broadcast. The multicast
and broadcast trees need to be aware of workload density between
sites to optimize the broadcast and multicast traffic.
There are significant challenges related to virtual MAC overlap when
connecting multiple datacenters. Note that virtual MACs are assigned
administratively and these can overlap when many sites are connected,
especially when private and public domains that cross administrative
boundaries are connected. These overlaps will cause traffic loss.
Dalela Expires June 30, 2012 [Page 11]
Internet-Draft Datacenter Approaches December 2011
The scaling issues with the L2 schemes are identical to those as seen
within the datacenter or for L3 inter-datacenter interconnects. That
is, host routes are required for VM mobility. In fact with L2, the
scaling is worsened because L2 addresses can't be summarized like L3
addresses. There will always be a per MAC entry even if the entire
subnet is located at one site.
5.3.4. Common Intra and Inter Datacenter Technology
This approach treats multiple interconnected datacenters as one huge
domain. The interconnection between sites must of course take place
over the L3 Internet, but the networking technology can just treat
that as an overlay. That is, the remote location is determined
according to intra-datacenter forwarding, and tunneled over L3. The
scaling properties of this approach are identical to the scaling
properties of the various intra-datacenter approaches.
For example, if encapsulation is used within the datacenter for
mobility, and there are N switches in the first datacenter and M
switches in the second, then the first datacenter will need M
mappings between remote switch addresses and the edge locator switch
address, while the second datacenter will need N such mappings.
This is much better than when we use a different technology within
and between datacenters. As example, by extending the encapsulation
scheme we don't need host routes, but only switch routes. This is a
few orders of magnitude more scalable at the edge. But, note that if
both datacenters are large, it may worsen the scaling at the access
because a host in one location is talking to multiple hosts in
another location. The encapsulation approach scales well in the core
and this is true when the core includes a tunnel over the Internet.
Similarly, if hierarchical MAC addresses were assigned within the
datacenters, and the switch-ids across datacenters are mutually
exclusive, then these two datacenters can be treated as one large
datacenter. Each datacenter will need to store M and N bindings at
the edge, similar to the encapsulation case above. This scheme scales
well both at the access, in the core, and at the datacenter edges.
While there are many advantages in using the same technology across
datacenters, there can be challenges in managing these administrative
domains in the same way. For instance, switch-ids across these
networks must be non-overlapping. These problems are no worse than if
different approaches are employed within and between datacenters
because one has to ensure unique MAC and IP addressing anyway.
Hierarchical addressing in fact reduces the overhead from unique host
Dalela Expires June 30, 2012 [Page 12]
Internet-Draft Datacenter Approaches December 2011
MACs to unique switch IDs. Protocols that assign switch-ids uniquely
would further reduce the overhead to unique IP only.
5.4. Forwarding Approaches
Industry is divided in opinion on this and a lot has been already
said about this. What can we add here? We are not going to repeat
what has been already said, but make two additional points.
First, datacenter traffic includes not just TCP/IP but also Fiber
Channel and InfiniBand. These technologies were developed at a time
when Ethernet did not provide high speeds. Now that Ethernet gives
10G and 40G speeds, it is no longer necessary to maintain separate
networks. These networks can be converged over L2 or L3, and this is
an important consideration to keep in mind in deciding the right
approach. Maintaining multiple parallel networks isn't practical.
Second, there are scaling issues in L2 when the network size grows,
aside from issues that inter-VLAN traffic (L3) does not use ECMP
which constrain the cross-section bandwidth across VLANs. These
scaling issues should be taken into account in deciding an approach.
5.4.1. L3 Forwarding
Datacenters have a significant amount of non-TCP/IP traffic. In fact
bandwidths on these links have traditionally been much higher than
Ethernet (which the reason that they were designed because Ethernet
could not deliver those speeds earlier). The bandwidth gap no longer
exists, but it is important to continue using these technologies.
Fiber Channel (FC) is used for SAN while InfiniBand (IB) is used for
networked IPC. FC is used in most enterprise networks while IB is
used in High Performance Computing (HPC) clusters.
Mechanisms to converge non-TCP/IP traffic over TCP/IP have been made.
These mechanisms have two broad types of issues. First, if the TCP/IP
runs in software, the overheads in TCP/IP consume a lot of CPU and
render a lower performance. Second, if TCP/IP runs in hardware, the
cost of the NIC is very high given the complexity of doing TCP in
hardware. The cost/performance of the TCP/IP based solutions is not
at the desired level for FC and IB traffic types. However, if a
provider does not have significant FC/IB traffic or is prepared to
bear the cost of more expensive NICs, then TCP/IP based solutions -
such as iSCSI for FC and iWARP for IB - can also be employed.
As already discussed L3 scales very well but does not natively
support mobility. Encapsulations need to be used to support mobility
but these create significant scaling issues at the access.
Dalela Expires June 30, 2012 [Page 13]
Internet-Draft Datacenter Approaches December 2011
5.4.2. L2 Forwarding
L2 forwarding simplifies network storage and IPC. Ethernet can be
used to converge TCP/IP, FC and IB traffic onto the same physical
link at the desired levels of cost and performance. This will lead to
a reduction in the datacenter networking costs, by eliminating
multiple types of NICs, cables and switches. The total number of
ports can also be reduced, increasing port utilization.
However, to support non-TCP/IP traffic, L2 networks also need to
support Datacenter Bridging (DCB) specifications. These include
Congestion Notification, per-priority VL, and DCBX. These changes
require hardware changes at the access and may not be preferred in
the short run. Providers may prefer to use L3 in the short run.
Traditional L2 forwarding further brings several scaling issues.
First, when packets cross VLAN boundaries, they must use a default
gateway. Inter-VLAN traffic passes through this default gateway,
which means that it cannot use multi-paths to a destination. As the
inter-VLAN traffic grows the chances of packet drops are high,
because this traffic cannot use multi-paths to destination.
Second, traditional L2 forwarding requires each MAC address to be
learnt, and that is a scaling concern, especially in the core. This
problem can be addressed by encapsulating packets into remote
locators, only so long as the datacenter is not connected to the L3
internet. When a datacenter is connected to L3 internet and hosts can
be accessed from outside, per-host IP to MAC bindings are needed at
the datacenter edge. This obviates the benefits of encapsulation in
the core, because the core needs per host L3-L2 mappings.
Third, if we solve the inter-VLAN traffic problem by distributing the
default gateway across many such devices (to enable multi-path), it
requires all the switches at the L2-L3 boundary to learn about all
the IP-MAC bindings. Effectively, now we have multi-path but the
original scaling problems with L2 are back because each network point
in the core needs to know the MAC-IP binding for each host.
Fourth, the problem of ARP broadcast in a VLAN and STP turning off
ports is well-known. However note that ARP and STP are separate
issues from the above scaling issues, which will exist even when STP
is off or if ARP scaling issues have been addressed.
Dalela Expires June 30, 2012 [Page 14]
Internet-Draft Datacenter Approaches December 2011
5.4.3. Hybrid Approaches
Hybrid approaches bring L3 routing algorithms to L2. These turn off
STP and enable multi-paths. However, this does not address the
mobility problem. In the L2 network, this implies learning all MAC
addresses in the core. To avoid this, encapsulation can be used,
which simplifies the core, but makes the access much worse.
Hierarchical MAC addresses can solve these scaling problems. They
don't need encapsulation and hence they address scaling problems
arising from host mobility at both access and in the core.
Hierarchical MACs create a global address space for MAC addresses.
Hence, these packets can cross VLAN boundaries easily. The trick
required here is not to tag unicast packets with VLAN tags (L2
multicast and broadcast packets must still be tagged with VLAN tags).
The packets must however be marked with the appropriate tenant id of
choice. The packet will be forwarded to destination using the MAC
address and matched against the allowed tenant id on the destination
port. The packet will be dropped at the destination port if the
tenant id's at the source and destination ports do not match.
When a L2 datacenter has to be connected to the L3 Internet, L2-L3
mappings are required at the datacenter-Internet boundary. This is
because inside the datacenter packets are switched based on MAC
addresses while outside they are routed based on L3. This requires
per-host entries to map each host IP to their hierarchical MAC, with
one important difference. The difference is that these entries are
required only for the north-south traffic and hence don't need to be
present at every core switch. These per-host entries can therefore be
distributed over multiple core switches, each of which advertizes a
per-tenant set of IP routes to the PE router. The default gateway for
all internet routes can be pinned on one of the core routers and this
will allow the distribution of L2-L3 entries.
Note that these L2-L3 mappings will be created through ARP broadcast
when hosts in datacenter converse with Internet hosts. If these
conversations are few, then the L2-L3 entries are correspondingly
reduced. The key mechanism for scaling however is distribution over
multiple core switches, which will work in all cases.
5.5. Discovery Approaches
Two broad discovery approaches are proposed today. First, address
discovery can be based on traditional routing protocols that push the
address location into the network. This has the potential of causing
instability due to frequent device creation and mobility. Second,
address discovery can be pushed into a central registry, from where
Dalela Expires June 30, 2012 [Page 15]
Internet-Draft Datacenter Approaches December 2011
it can be pulled or pushed on a need basis. This approach bypasses
the update everywhere and can update only select locations.
5.5.1. Protocol Based Route Learning
A traditional routing protocol will carry each subnet or individual
host route at the control plane and propagate its location. The
location would be known everywhere through the control plane and this
can be programmed in hardware. We have seen that all host-route
approaches are not scalable at the forwarding plane. Individual route
updates are also heavy on the control plane. In fact frequent updates
due to link toggling, resource creation and deletion, mobility, etc.
will create serious convergence issues in the network.
Traditional L3 networks have been based on static subnets that don't
change frequently. This helps in scaling the network and keeping it
converged. This property of networks needs to be preserved, although
the challenge with L3 is mobility and scaling issues with mobility.
5.5.2. Address Location Registries
An address or subnet is discovered (through conversation learning or
static configuration) and propagated into a registry, along with the
address of the location to reach it. Any network node that has to
send traffic to this address can look up the registry to find the
address location before transmission. Once looked up, the location
can be cached for a long period of time. This has the advantage that
it serves information on-demand. The disadvantage is that when the
information changes, everyone will not be aware of the change. They
will therefore continue to forward packets to the old location, and
the packets will be black-holed. If however, every network entity is
made aware of the change immediately through an update upon change,
then this becomes similar to the routing update above.
There are sometimes concerns expressed about learning the routes
real-time after the packet arrives. In general, the number of such
lookups is of the same order of magnitude as DNS lookups which are
done at the host level. The signaling overheads are not therefore
significant per se, if the flows are all legitimate.
The difference between host DNS lookup and the real-time route lookup
is that no packets are being sent before DNS lookup whereas a large
packet burst could be sent in the route lookup case. The burst cannot
be forwarded until a route has been received. This is a potential
security issue, if users send spurious bursts to non-existent IP
addresses. The router will buffer the packets and send queries which
will fail. Meanwhile, legitimate packets would have been queued up
Dalela Expires June 30, 2012 [Page 16]
Internet-Draft Datacenter Approaches December 2011
and will result in tail drop. Spurious IP scanning attacks can be
launched to try and reach non-existent addresses. These attacks can
be used to significantly load the control plane as well.
5.5.3. Routing-Registry Hybrid Approach
In the hybrid approach, a static routing protocol is applied to
discover all network routes, similar to Registry based approaches. In
the hierarchical MAC approach, this is a route table of switch-ids.
Packets are to be forwarded based upon these network routes. The
trigger for location discovery is however tied to the ARP request.
The ARP request must be trapped at the access switch and forwarded to
a central Registry. They difference here is that the trigger to the
Registry query is not arrival of data traffic, but the arrival of an
ARP request. This approach mimics the DNS behavior more accurately
because during an ARP request, no packets are being sent. Note that
this solution will work only in a L2 network.
While IP scanning attacks will load the control plane with location
discovery there is no issue about tail drops. Further, more
sophisticated control plane mechanisms can be done to detect such IP
scans since the triggers are control plane messages.
When the VM moves, two possible schemes can be adopted. First, the
new MAC address can be flooded to all corresponding hosts, via a
Gratuitous ARP. The access switches will trap the Gratuitous ARP and
create a binding to the new location. If we are using hierarchical
MACs then bear in mind that many hosts will reject a Gratuitous ARP
to avoid MAC hijacking. This is thus not an optimal solution.
Second, a temporary redirect entry at the earlier source may be
installed to redirect packets from the old to the new location. Note
that the ARP cache will be refreshed by each host periodically
(typically 15-30 seconds), so the redirect is not permanent. The
registry owns the installation of the temporary redirect. This
creates a sub-optimal routing path for a short period of time, but it
avoids the heavy control plane traffic to update every new source
with the new location. In time, every host will ARP for the
destination again and will learn about the new location. The
temporary redirect can therefore be removed after 15-30 seconds,
which is the time within which we can expect the host to re-ARP.
This solution solves the sudden control plane burst on a move, but it
introduces the problem that ARPs have to be periodically forwarded to
the registry to resolve. This isn't a scaling problem for the router
control plane, but a scaling issue for the central registry. Note
that ARPs on a L2 network can be huge. Forwarding them to a central
Dalela Expires June 30, 2012 [Page 17]
Internet-Draft Datacenter Approaches December 2011
registry therefore needs to be handled with care. Of course, this
central registry can as well be a load-balanced cluster of many nodes
that share the data between them. That way, the ARP load can be
dynamically addressed as the scale increases.
5.6. Cloud Control Approaches
Cloud control comprises of several functions, including discovery
within and across sites, orchestration of resources, debugging and
analytics. There are two broad approaches to cloud control. First,
the cloud control is built with application level APIs, such as HTTP
based web-services (SOAP or REST). Second, the cloud control is
embedded as a network protocol and closely tied to other network
functions. These approaches are discussed below.
5.6.1. Application APIs
An application API is a client-server model of communication. These
APIs when run over HTTP have the advantage that they can cross
firewalls. They are easy to implement and directly expose developer
level constructs for software programming.
There are however some limitations in the use of APIs. First, every
API projects the application view of information into the network
(the packet format is constructed from the API format). In the longer
term, this means APIs will generally not interoperate, because of
semantic and syntactical differences. If we converge upon a single
API standard, services deployed using existing APIs will not work.
Second, APIs as client-server constructs don't facilitate discovery,
which depends on broadcast and solicitation, prior to knowing the IP
or DNS of the endpoints. Third, APIs don't facilitate transactions
with the ability to commit or cancel in case of failures. APIs don't
give the ability to ask questions half-way through a transaction or
cancel a transaction mid-way. An API may hang and closing the
connection may result in leaked resources. Fourth, APIs don't
facilitate a policy control at the network edges which is very
important when connecting private and public domains or two public
domains. Fifth, it is harder to build single sign-on capabilities
with APIs because API authentication depends on the server, which
needs to have the user's credentials although these credentials may
not be shared across different administrative boundaries.
Even more important than the above issues is that API orchestration
is generally unaware of network topology. When orchestrating a
distributed system it is very important to know the topology. For
instance, if a VM is being allocated, bandwidth may need to be
reserved on the path. Likewise, if a VM is being moved, appropriate
Dalela Expires June 30, 2012 [Page 18]
Internet-Draft Datacenter Approaches December 2011
policies like QoS and security need to be dragged along. Firewall
rules may need to be installed in the path to the VM. In case of
disaster recovery, it is important to know which paths packets will
take to the new destination. All these things require a view of the
network topology, both logical and physical. It isn't enough to know
the IP address of the various devices, but also the paths.
5.6.2. Network Protocol Approach
Network topology is known in the network. A close coupling between
the network state and the orchestration is needed for effective
orchestration. A significant portion of orchestration is making the
decision about the location of a service based on whether capacity is
available. This includes compute, network, storage, security, etc.
Orchestration across these multiple domains cannot be done without a
good knowledge of network topology. A close coupling between network
and orchestration is also needed to debug performance issues, or when
services aren't being created in the desired manner.
This close coupling between network and orchestration is easily
achieved if the orchestration is embedded in the network because then
it can easily access the network state such as the location of
devices, the shortest paths, bandwidth availability, etc.
To achieve this, a standard protocol is needed to orchestrate multi-
domain services. This protocol can be used by all existing APIs or
even new ones. The protocol will represent the network view of
information while APIs represent the application view. Protocols have
always been used in the Internet for interoperability. Using such
protocols it would be possible to interoperate currently incompatible
APIs. For instance, different APIs could be used in private and
public domain as long as they exchange information using a common
protocol. Protocols also facilitate easy discovery using mechanisms
such as broadcast and multicast, reducing the configuration overhead.
6. Recommendations
Based on the discussion above, the scaling properties of various
mobility solutions are listed below. There are four types of scaling
issues discussed so far: (a) datacenter access, (b) datacenter
interconnect, (c) datacenter-internet, and (d) datacenter core.
These functions can be combined in the same network device, or may be
kept separate. Logical separation allows for a clearer discussion of
the scaling attributes of these functions. Further reasoning of
having these separate is described in greater detail below.
Dalela Expires June 30, 2012 [Page 19]
Internet-Draft Datacenter Approaches December 2011
The below table summarizes the host-route issues in the various
scenarios at the various points in the network.
+-------------------------------------------------------------------+
| Switch Scaling Requirements for Datacenter Mobility |
+-------------------------------------------------------------------+
| Approach | Access | Core | Interconnect | Internet |
+===================================================================+
| Vanilla L2 | HIGH | MASSIVE | MASSIVE | HIGH |
+-------------------------------------------------------------------+
| L2/L3 Encap | HIGH | LOW | MASSIVE | HIGH |
| w/ separate | | | | |
| DC and Inter-DC| | | | |
| approaches | | | | |
+-------------------------------------------------------------------+
| L2/L3 Encap | HIGH | LOW | LOW | HIGH |
| w/ identical | | | | |
| DC and Inter-DC| | | | |
| approaches | | | | |
+-------------------------------------------------------------------+
| Hierarchical | LOW | LOW | LOW | HIGH |
| MAC addressing | | | | |
+-------------------------------------------------------------------+
Table-1: Scaling Comparison of Datacenter Approaches
From the above, we can see that the Hierarchical MAC addressing fares
better than all other approaches. The only place it has a high need
is at the datacenter-internet boundary. This issue can be addressed
by distributing these over multiple cores since the boundary only
involves north-south traffic and does not need ECMP.
Based on this analysis, the following conclusions can be arrived at,
as recommendations for further work:
- It is important distinguish the datacenter interconnect boundary,
the datacenter-internet boundary, datacenter core and access from
a scaling perspective. This is because private addresses can be
advertized between datacenters, but they can't be advertised into
the internet. At the internet boundary north-south traffic is
required but at the core, east-west traffic is required.
- The technology within and between datacenters should be identical.
This allows us to treat datacenter interconnects similar to the
datacenter core and interconnects can be scaled easily using
common techniques. Interconnects can use MPLS VPNs and a cloud can
be treated as a new "site" for private networks.
Dalela Expires June 30, 2012 [Page 20]
Internet-Draft Datacenter Approaches December 2011
- Hierarchical MACs offer the best scaling and mobility properties.
They will lead to the most scalable network designs. The scaling
properties are particularly important at access because of the
huge number of access devices in the datacenter.
- Hierarchical MAC assignments could be manual or could be done
automatically using a new protocol. The new protocol could include
just switch/router level or even host level assignments.
- Hierarchical MACs (when combined with DCB) can also be used to
consolidate TCP/IP, Storage and IPC traffic over Ethernet. If DCB
is not available, then iSCSI and iWARP can be used over L2
forwarding. This affords the best scaling properties in the
interim. Over time, when DCB is available, datacenter can move to
consolidating FC and IB traffic over Ethernet.
- A hybrid discovery approach that separates host and network
address discovery needs to be used to maintain network resiliency.
Routing protocols will do network discovery while ARP should be
used for host location discovery. This gives the best results for
both the forwarding and control plane scale.
- ARP scaling is a control plane scaling issue and should be
addressed through central registries. A new protocol is required
to interact with the registry. This protocol must have mechanisms
to query and update the registry. This protocol must also support
installing temporary redirects (can be done through updates).
- Segmentation must involve an identifier orthogonal to the VLAN
tag, because this can easily overlap across boundaries. Given the
use of L2 networks, the tag should be just above the Ethernet
layer. MPLS is a layer 1.5 technology that can be used. Note that
it does not require label switching inside the datacenter to use
these tags, because packets will still be forwarded using MAC
addresses. MPLS tags will only identify various tenants, and are
to be treated just like VLAN tags, although in a separate space.
Full VLAN range (including Q-in-Q) will be available for each
tenant. MPLS already segments customers in the Internet.
- Cloud control needs a protocol that runs parallel to other network
protocols to facilitate discovery through broadcast or multicast.
A close coupling between the orchestration and networking
functions can be achieved if this protocol runs in the network.
This does not hinder use of variety of API formats. But, it gives
mechanisms to provide a better intelligence into orchestration.
Dalela Expires June 30, 2012 [Page 21]
Internet-Draft Datacenter Approaches December 2011
7. Network Architecture
This section is illustrative only. We have already shown that the
different datacenter functions (access, core, interconnect and
internet boundary) have different scaling properties, with different
types of datacenter approaches. This section shows how these
functions can be integrated together. Treating these functions
separately allows independent assessment of scale needs.
+--------+ +--------+ +--------+ +--------+
| Core | | Core | | Core | | Core |
+--------+ +--------+ +--------+ +--------+
....................
ECMP Mesh
....................
+------+ +------+ +----+ +----+ +----+ +----+ +------+ +------+
| DC-I | | DC-I | | AC | | AC | | AC | | AC | | L3-I | | L3-I |
+------+ +------+ +----+ +----+ +----+ +----+ +------+ +------+
Figure-1: Illustrative Network Architecture
In the above picture, "Core" represents the datacenter core with
links to all DC-I, L3-I and AC. This allows any to any connectivity
between Access, Interconnect and Internet boundaries. "DC-I" is the
Datacenter Interconnect between various datacenters. "AC" represents
all the access switches. Aggregation layer is not shown, but could be
present depending on the scaling needs. The "L3-I" represents the L3
Internet termination at the Datacenter boundary.
Note that a large datacenter will have several thousand Access
switches and a few dozen Core switches. The number of L3-I switches
depends on the extent to which the network faces traffic from outside
the Internet. If this was a HPC cloud, the Internet traffic will be
very small. If this was a Web 2.0 cloud, the Internet traffic would
be a higher percentage of the total traffic. If this was a hosted
public cloud with small and medium sized applications, most the
traffic would be north-south and concentrated at L3-I. Accordingly
the L3-I function needs to be scaled independently.
Similarly, the extent of the DC-I function depends on the number of
datacenters being connected and the inter-datacenter traffic. In case
of extensive site-to-site mobility or in the case of hybrid cloud,
this function would be heavily loaded. If there is no site-to-site
mobility or no hybrid clouds, the traffic here would be low.
Dalela Expires June 30, 2012 [Page 22]
Internet-Draft Datacenter Approaches December 2011
8. Security Considerations
NA
9. IANA Considerations
NA
10. Conclusions
This document analyzed multiple approaches that can be adopted for
addressing datacenter issues and makes recommendation on a consistent
approach. These recommendations can be used to further discuss and/or
develop cloud datacenter problems in a holistic manner.
11. References
11.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
11.2. Informative References
[REQ] Datacenter Network and Operations Requirements
http://www.ietf.org/id/draft-dalela-dc-requirements-00.txt
12. Acknowledgments
This document was prepared using 2-Word-v2.0.template.dot.
Dalela Expires June 30, 2012 [Page 23]
Internet-Draft Datacenter Approaches December 2011
Authors' Addresses
Ashish Dalela
Cisco Systems
Cessna Business Park
Bangalore
India 560037
Email: adalela@cisco.com
Dalela Expires June 30, 2012 [Page 24]
Html markup produced by rfcmarkup 1.98, available from
http://tools.ietf.org/tools/rfcmarkup/